Introduction

In this research project, I will use various models to predict ambient air concentrations (PM2.5) across the United States. “In our data set, the value column indicates the PM2.5 monitor annual average for 2008 in mass of fine particles/volume of air for 876 monitors located in counties across the United States. The units are micrograms of fine particulate matter (PM) that is less than 2.5 micrometers in diameter per cubic meter of air - mass concentration (ug/m3).” This data is taken from across the U.S. Environmental Protection Agency’s monitoring network.

Along the way, we will be raising and answering questions about our prediction algorithms.

Important Predictor Variables Summary

Variable Details Example
id Monitor number – the county number is indicated before the decimal 1073.0023 is Jefferson county (1073) and .0023 one of 8 monitors
Lat Latitude of the monitor in degrees
Lon Longitude of the monitor in degrees
state State where the monitor is located
county County where the monitor is located
city City where the monitor is located
CMAQ Estimated values of air pollution from a computational model called Community Multiscale Air Quality (CMAQ) – A monitoring system that simulates the physics of the atmosphere using chemistry and weather data to predict the air pollution
zcta Zip Code Tabulation Area where the monitor is located – Postal Zip codes are converted into “generalized areal representations” that are non-overlapping – Data from the 2010 Census
zcta_area Land area of the zip code area in meters squared – Data from the 2010 Census
imp_a500 Impervious surface measure – Within a circle with a radius of 500 meters around the monitor – Impervious surface are roads, concrete, parking lots, buildings – This is a measure of development
imp_a1000 Impervious surface measure – Within a circle with a radius of 1000 meters around the monitor
Log_dist_to_prisec Log (Natural log) distance to a primary or secondary road from the monitor – Highway or major road
log_pri_length_5000 Count of primary road length in meters in a circle with a radius of 5000 meters around the monitor (Natural log) – Highways only
log_nei_2008_pm25_sum_10000 Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 10000 meters of distance around the monitor (Natural log)
pov Percentage of people in zcta area where the monitor is that lived in poverty in 2008
urc2013 2013 Urban-rural classification of the county where the monitor is located - 1 is totally urban 6 is completely rural
aod Aerosol Optical Depth measurement from a NASA satellite – based on the diffraction of a laser – used as a proxy of particulate pollution – unit-less - higher value indicates more pollution – Data from NASA
#Load the Necessary libraries
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(caret)
library(tidycensus)
library(scales)
library(leaflet)

Model 1 - Linear Regression

Linear regression, a fundamental modeling technique, is one of the most significant tools in modern data analysis. It is capable of capturing simple to complex patterns in datasets, making it highly versatile. The traditional use of linear regression involves predicting continuous response variables. I will be using the primary function lm() in this project.

\({y = B_0 + B_1x + error}\)

Model 2 - kNN (k-Nearest-Neighbors)

kNN is a non-parametric algorithm, meaning it doesn’t make assumptions about the underlying data distribution. Instead of learning a model from the training data, kNN directly uses the entire training dataset as its “model.” The kNN algorithm uses similarity (Euclidean Distances) to find the closest neighbors to a new data point that I am interested in. It uses the average of those data points to classify the value of the data point in observation. Since the algorithm is more advanced than linear regression, it requires tuning parameters.

Model 3 - Random Forest

The Random Forest algorithm builds decision trees on different samples of the training data, and then combines them through a process called “bagging.” Bagging involves creating multiple subsets of the training data with replacement, and training a decision tree on each subset. For regression, the final prediction is based on the average of the predictions from the decision trees. By combining the predictions of multiple decision trees, Random Forests can improve the accuracy and robustness of the overall model.I will use the ranger engine for this method.

Hyper-parameters for tuning: min_n sets the minimum node size, and mtry sets the number of predictors that are randomly selected at each split in the decision tree.

Model 4 - Boosted Trees

In the fourth model, I used the Boosted trees algorithm; boosting is a technique that combines multiple weak decision trees to generate a final output. It involves building a model by using weak models, which although can have poor performance on their own can lead to strong predictions when used at robust amounts together. I will be using the xgboost engine that uses gradient boosting, which relies on the intuition that the best possible next model minimizes the overall prediction error when combined with previous models. This also requires tuning hyper-parameters:

Trees sets the maximum number of decision trees that will be fitted to the data, Tree_depth sets the maximum depth of each decision tree in the model, min_n sets the minimum number of samples per leaf, and Learn_rate controls the contribution of each tree to the final prediction

Exploratory analysis

To get a general grasp of how air pollution varies across the US, let’s explore and rearrange the data to make it more understandable.

dim(dat)
## [1] 876  50
head(dat)

In this code chunk, I tried to identify a correlation between high PM2.5 and urban locations with high population density.

#Understand that states with more Urban areas (1) will reflect higher levels of PM2.5
dat %>% 
  group_by(urc2013) %>%
  summarize(meanURC = mean(value))
dat %>%
  group_by(state) %>%
  summarize(meanPM2.5 = mean(value), meanURC = mean(urc2013), 
            meanPopulationDensity = mean(popdens_zcta)) %>%
  ggplot(aes(meanPopulationDensity, meanPM2.5, color = meanURC, label = state)) +
  geom_point() + 
  geom_text(hjust=0, vjust=0, size = 2)
\label{fig:figs} PM2.5 in states with different Urban/Rural classifcation

PM2.5 in states with different Urban/Rural classifcation

dat %>% 
  ggplot(aes(aod, value)) +
  geom_point() + 
  geom_smooth(method = lm) + 
  labs(title = "aod vs observed PM2.5")
\label{fig:figs} aod performance

aod performance

AOD tends to overpredict.

dat_mean <- dat %>%
  group_by(state) %>%
  summarise(meanPM2.5 = mean(value))

us_state_pop <- get_acs(
  geography = "state", 
  year = 2021,
  variables = c("population" = "B01001_001"), 
  geometry = TRUE)
non_contiguous_regions <- c("Alaska", "Hawaii", "Rhode Island", "Puerto Rico")

us_contiguous <- us_state_pop %>% 
  filter(!NAME %in% non_contiguous_regions)

us_contiguous <- us_contiguous %>%
  rename(state = NAME)

us_contiguous <- us_contiguous %>%
  left_join(dat_mean, by = "state")

pal_states <- colorNumeric("viridis", us_contiguous$meanPM2.5)


label_state <- function(state, meanPM2.5){
  str_glue("{state} with mean PM2.5 of {meanPM2.5}")
}

us_contiguous %>% 
  leaflet() %>% 
  addPolygons(weight = 1,
              color = "white",
              fillColor = ~pal_states(meanPM2.5),
              fillOpacity = 1,
              popup = ~label_state(state, meanPM2.5)) %>% 
  addLegend(pal = pal_states,
            values = ~meanPM2.5,
            opacity = 1)
tx <- dat %>% 
    filter(state == "Texas") 
tx %>% 
    ggplot(aes(lon, lat)) + 
    geom_point(color = "sienna", size = 3) +
    labs(x = "Longitude", y = "Latitude") +  
    coord_fixed() +
    geom_path(aes(long, lat), 
              data = map_data("state") %>% 
                  filter(region == "texas"))

There are only 27 observations in the state of Texas, and we may already be concerned that there is unrepresentative data.

In the following code chunk, I looked for a correlation between location of the monitor in terms of latitude and longitude and air pollution in a state by rearranging summary statistics. I identified a small trend indidating that coastal states have on average lower levels of mean PM2.5.

dat %>%
  group_by(state) %>%
  summarize(meanLat = mean(lat), meanPM2.5 = mean(value)) %>%
  arrange(meanLat)
dat %>%
  group_by(state) %>%
  summarize(meanLon = mean(lon), meanPM2.5 = mean(value)) %>%
  arrange(meanLon)

Arriving at optimal predictors

I identified predictors that were useful in a linear regression and used those predictors in the other models as well. I plotted the relationship between log_nei_2008_pm25_sum_25000 and value in the visualization, which convinced us that these predictors can produce the best prediction model.

# select and scale numeric variables 
dat_withoutCategorical <- dat %>%
  select(is.numeric) %>%
  mutate_all(scale)

#create a linear fit and obtain coefficients 
lin <- lm(value~., data = dat_withoutCategorical)

summary(lin) 

#examine the linear relationship between one of the predictors
dat %>%
  ggplot(aes(log_nei_2008_pm25_sum_25000, value)) +
  geom_point() + 
  geom_smooth(method = lm) +
  labs(y = "PM2.5 (ug/mˆ3)")
\label{fig:figs} tons of emissions from major sources on pm2.5

tons of emissions from major sources on pm2.5

Here, I identified that the highest correlation existed between value and the variables zcta_area, imp_a500, imp_a1000 and log_nei_2008_pm25_sum_25000, with coefficients of -8.248e-02, -8.342e-02, and 9.767e-02, and 8.097e-01, respectively.

zcta_area - Land area of the zip code area in meters squared – Data from the 2010 Census

imp_a500 - Impervious surface measure – Within a circle with a radius of 500 meters around the monitor – Impervious surface are roads, concrete, parking

imp_a1000- Impervious surface measure – Within a circle with a radius of 1000 meters around the monitor – Impervious surface are roads, concrete, parking

log_nei_2008_pm25_sum_25000 - Tons of emissions from major sources data base (annual data) sum of all sources within a circle with a radius of 25000 meters of distance around the monitor (Natural log)

Using all the variables as predictors in a model can potentially lead to over-fitting, where the model fits too closely to the training data and performs poorly on new data. Therefore, I would expect the RMSE for our model with the top 3 predictors to be much smaller, since it lowers complexity.


Results

Process:

# Split dat into training and testing data
dat_split <- initial_split(dat)
dat_train <- training(dat_split)
# Start with linear regression, specify recipe with outcome and predictor
rec <- dat_train %>%
    recipe(value ~ imp_a1000 + imp_a500 + zcta_area + log_nei_2008_pm25_sum_25000) %>%
    step_normalize(imp_a1000, imp_a500, zcta_area, log_nei_2008_pm25_sum_25000) 
  
# Create a linear regression model with regression mode
model <- linear_reg() %>% 
    set_engine("lm") %>% 
    set_mode("regression")
# Create workflow
wf <- workflow() %>%
    add_recipe(rec) %>% 
    add_model(model)

res <- fit(wf, data = dat_train)
# Check performance on the complete training data
res %>% 
    extract_fit_engine() %>% 
    summary() 
# Perform cross validation 
folds <- vfold_cv(dat_train, v = 10)

res <- fit_resamples(wf, resamples = folds)

res %>% 
    collect_metrics()
# Fit final model to entire training set; evaluate on test set
final <- wf %>% 
    last_fit(split = dat_split)

final %>% 
    collect_metrics()
# Plot the observed model predictions vs PM2.5 values 
final %>% 
    collect_predictions() %>% 
    ggplot(aes(.pred, value)) +
    geom_point() + 
    geom_abline(intercept = 0, slope = 1) +
    labs(title = "Linear Regression Predicted vs Value", y = "PM2.5 (ug/mˆ3)")
\label{fig:figs} Linear Regression Model

Linear Regression Model

# Replicate this process across other models 
# Add principal component analysis feature extraction
rec <- dat_train %>%
    recipe(value ~ imp_a1000 + imp_a500 + zcta_area + log_nei_2008_pm25_sum_25000) %>%
    step_normalize(imp_a1000, imp_a500, zcta_area, log_nei_2008_pm25_sum_25000) %>%
    step_pca(all_predictors(), num_comp = 3)

# Use kNN algorithm and tune for the optimal number of neighbors
model <- nearest_neighbor(neighbors = tune("k")) %>% 
    set_engine("kknn") %>% 
    set_mode("regression")

wf <- workflow() %>% 
    add_model(model) %>% 
    add_recipe(rec)
# cross-validation with 10 folds
folds <- vfold_cv(dat_train, v = 10)

res <- tune_grid(wf, resamples = folds,
                 grid = tibble(k = c(10, 15, 20, 25, 30)))

res %>% 
    collect_metrics() %>% 
    filter(.metric == "rmse") %>% 
    ggplot(aes(k, mean)) +
    geom_point() + 
    geom_line()
\label{fig:figs} kNN Model

kNN Model

# After identifying 30 as the best neighbors, and to avoid underfitting 
model <- nearest_neighbor(neighbors = 30) %>% 
    set_engine("kknn") %>% 
    set_mode("regression")

wf <- workflow() %>% 
    add_model(model) %>% 
    add_recipe(rec)

folds <- vfold_cv(dat_train, v = 10)

res <- fit_resamples(wf, resamples = folds)

# Fit final model to entire training set; evaluate on test set
finalBest <- wf %>% 
    last_fit(split = dat_split)

final %>% 
    collect_metrics()

# Plot the observed model predictions vs PM2.5 values 
finalBest  %>% 
    collect_predictions() %>% 
    ggplot(aes(.pred, value)) +
    geom_point() + 
    geom_abline(intercept = 0, slope = 1) + 
    labs(title = "kNN Predicted vs Value", y = "PM2.5 (ug/mˆ3)")
\label{fig:figs} kNN Model

kNN Model

# try random forest
rec <- dat_train %>%
    recipe(value ~ imp_a1000 + imp_a500 + zcta_area + log_nei_2008_pm25_sum_25000) %>%
    step_normalize(imp_a1000, imp_a500, zcta_area, log_nei_2008_pm25_sum_25000)
# Try a grid of tuning parameters
model <- rand_forest(mtry = tune("mtry"),
                     min_n = tune("min_n")) %>% 
    set_engine("ranger") %>% 
    set_mode("regression")
# Set workflow 
wf <- workflow() %>% 
    add_recipe(rec) %>% 
    add_model(model)
# Fit model over grid of tuning parameters For "mtry," 
# try a range of values between the square root of the number of predictors 
# and the total number of predictors For "min_n," try values between 1 and 100.
folds <- vfold_cv(dat_train, v = 10)

res <- tune_grid(wf, resamples = folds, 
                 grid = expand.grid(mtry = c(2, 3, 4),
                                    min_n = c(5, 10, 15, 20, 25, 30, 35, 40, 45, 50)))

# Fit the best model obtained from tuning
best_params <- select_best(res, metric = "rmse")

model <- rand_forest(mtry = best_params$mtry,
                     min_n = best_params$min_n) %>% 
    set_engine("ranger") %>% 
    set_mode("regression")

wf <- workflow() %>% 
    add_recipe(rec) %>% 
    add_model(model)

# Fit final model to entire training set; evaluate on test set
final <- wf %>% 
    last_fit(split = dat_split)

final %>% 
  collect_metrics()

# Plot the observed PM2.5 values vs. model predictions
final %>% 
    collect_predictions() %>% 
    ggplot(aes(.pred, value)) +
    geom_point() + 
    geom_abline(intercept = 0, slope = 1) +
    labs("Random Forest Algorithm Predicted vs Value", y = "PM2.5 (ug/mˆ3)")
\label{fig:figs} Random Forest Model

Random Forest Model

rec <- dat_train %>%
    recipe(value ~ imp_a1000 + imp_a500 + zcta_area + log_nei_2008_pm25_sum_25000) %>%
    step_normalize(imp_a1000, imp_a500, zcta_area, log_nei_2008_pm25_sum_25000)

model <- boost_tree(trees = tune("trees"),
                    tree_depth = tune("tree_depth"),
                    min_n = tune("min_n"),
                    learn_rate = tune("learn_rate")) %>% 
    set_engine("xgboost") %>% 
    set_mode("regression")

wf <- workflow() %>% 
    add_recipe(rec) %>% 
    add_model(model)

# Define tuning grid
grid <- grid_latin_hypercube(trees(), tree_depth(), min_n(), learn_rate(), size = 10)
res <- tune_grid(wf, resamples = vfold_cv(dat_train, v = 5), 
                grid = grid, control = control_grid(verbose = TRUE))

# Get best model
best_params <- select_best(res, metric = "rmse")

# Finalize model with best hyper-parameters
model <- boost_tree(trees = best_params$trees,
                    tree_depth = best_params$tree_depth,
                    min_n = best_params$min_n,
                    learn_rate = best_params$learn_rate) %>% 
    set_engine("xgboost") %>% 
    set_mode("regression")

wf <- workflow() %>% 
    add_recipe(rec) %>% 
    add_model(model)

#Fit on testing data 
final <- wf %>% 
    last_fit(split = dat_split)

final %>% 
  collect_metrics()
# Plot the observed PM2.5 values vs. model predictions
final %>% 
    collect_predictions() %>% 
    ggplot(aes(.pred, value)) +
    geom_point() + 
    geom_abline(intercept = 0, slope = 1) + 
    labs(title = "Boosted Trees Algorithm Predicted vs Value", y = "PM2.5 (ug/mˆ3)")
\label{fig:figs} Boosted Trees Model

Boosted Trees Model

Below is a summary table of prediction metrics on the same testing data set, where I determined that kNN was the best model predictive model. By comparing the RMSE metrics of the 4 models, I concluded that the kNN algorithm performed the best, followed by linear regression, random forest, and boosted trees

##               Model    Metric     Value
## 1 Linear Regression      RMSE 1.9805512
## 2 Linear Regression R-squared 0.2096041
## 3               kNN      RMSE 1.9567116
## 4               kNN R-squared 0.2200484
## 5     Random Forest      RMSE 1.9848881
## 6     Random Forest R-squared 0.2080064
## 7     Boosted Trees      RMSE 2.0240379
## 8     Boosted Trees R-squared 0.1995826

Questions

  1. Based on test set performance, at what locations does your model give predictions that are closest and furthest from the observed values? What do you hypothesize are the reasons for the good or bad performance at these locations?

The locations Bakersfield, California, Nogales Arizona, and Libby, Montana were the worse performing locations using our model. They are somewhat correlated on the rural end of urban-rural classification urc2013 metric. The best performers were in Ventura, California (60 miles northwest of Los Angeles), St. Paul, Minnesota, and Milwaukee, Wisconsin. This is not surprising as they are in the middle of the pack as relatively average sized cities with proximity to other air pollution sources.

#worst performing locations
finalBest %>%
  collect_predictions() %>%
  mutate(difference = value - .pred) %>%
  arrange(desc(abs(difference)))
# top 3 worst performers
dat[76,]
dat[39,]
dat[502,]
#best performing locations
finalBest %>%
  collect_predictions() %>%
  mutate(difference = value - .pred) %>%
  arrange(abs(difference))
# top 3 performers
dat[141,]
dat[459,]
dat[855,]
  1. What variables might predict where your model performs well or not? For example, are their regions of the country where the model does better or worse? Are there variables that are not included in this dataset that you think might improve the model performance if they were included in your model?

After examining the rows that had the least difference between predicted and observed PM2.5, I found that the worst-performing locations were all concentrated on the western side of the country, namely in the states of California, Idaho, Montana, and Arizona.

Counties in the aforementioned states tend to be much larger than the average county in the United States, which could lead to skewed data. For example, Las Vegas County is very large in area, which could have led our model to predict that the PM2.5 values were lower than they actually were, while its population density is surprisingly very high. The model predictions that were closest to the observed values were dispersed throughout the country with no identifiable correlation. However, these locations tended to have higher PM2.5 values, which could suggest that our model is under-predicting the level of PM2.5 for these locations.

Categorical data might predict whether our model performs well since they generally have similar features when grouped together, especially by city. Since air pollution stems mostly from household combustion devices, motor vehicles, and industrial facilities, variables that most specifically reflect the influence of these categories can be more accurate in predicting levels of PM2.5. For example, a variable containing the number of registered active vehicles in a particular area could be more accurate than the amount of roadway; impervious surfaces could be constructed but not contribute to the high amounts of emissions (imp_a1000), and tons of emissions from major sources might not include vehicles (log_nei_2008_pm10_sum_25000).

  1. There is interest in developing more cost-effective approaches to monitoring air pollution on the ground. Two candidates for replacing the use of ground-based monitors are numerical models like CMAQ and satellite-based observations such as AOD. How well do CMAQ and AOD predict ground-level concentrations of PM2.5? How does the prediction performance of your model change when CMAQ or AOD are included (or not included) in the model?

The model performs better when predictors CMAQ and AOD are added, which suggests that there is evidence that AOD and CMAQ are effective in monitoring air pollution. Without the CMAQ predictor, the RMSE for our kNN Algorithm was 1.9567116; with it as a normalized predictor, the RMSE decreased to 1.7626403. I expected this to be the case when I originally identified a linear correlation between CMAQ and observed PM2.5 value. Using Figure 9, I also saw that the predicted values and observed values had a much tighter linear correlation. The same exists for AOD; when the model included AOD as a predictor, the final RMSE was 1.8692177, which is marginally better than 1.9567116.

\label{fig:figs} kNN Model with CMAQ

kNN Model with CMAQ

  1. The dataset here did not include data from Alaska or Hawaii. Do you think your model will perform well or not in those two states? Explain your reasoning.

It’s possible that the model trained on air quality data in the continental US will be inaccurate when predicting air quality in Alaska and Hawaii. This is because Hawaii and Alaska exhibit statistics and geographical elements that are very different from those of the continental United States. However, I still had a very robust data set and predictors that might provide an accurate prediction. For example, the zcta_area predictor should be able to reflect the positive relationship between land area and less air pollution; I expect this model to perform well on Alaska in particular due to its large land area, relative lack of highways etc.

Reflection

Examining Figures 4 - 8, it is evident that the kNN model created the closest prediction to the observed value correlation. I confirmed my hypothesis that the highest correlation would be between the PM2.5 value and proximity of the monitor to highways or surfaces that vehicles can traverse, since other predictors led to higher RMSE According to outside studies, emissions from the combustion of gasoline, oil, diesel fuel or wood produce much of the PM2 existent in the air, which also confirms this finding. On the other hand, I was surprised that zcta_area was another strong predictor since states could have a very large zip-code area but not reflect the urbanized aspect.**

The most challenging aspect of this project was understanding how to choose and use predictors using correlation analysis and tuning the various models, especially boosted trees. It was important to use the entire dataset to find the highest absolute value coefficients for predictors, since training data was only a percentage of the data and would produce different coefficients. I initially used training data to determine predictors using a linear fit, which was a mistake; I realized this would change the best predictors each time I ran the code since the training data was a randomized percentage of the original dataset. It was also important to use the same training data to compare the RMSE on different models. I learned about the various differences in using modeling algorithms, including the usage of decision trees. The recipe was applicable to the various models and was advantageous because it provided a way to develop models that were concise and reproducible and made it easier to share my work with others.

Reflection on performance of models & Limitations:

I expected boosted trees to perform the best (better than Random Forest and kNN), but due to various possible factors, it was the worst prediction model. Boosted trees is a more advanced algorithm, but it can be prone to overfitting in datasets of this size. There is also the issue of hyper-parameters; the RMSE for boosted trees decreased as I increased folds validation and the combinations of hyper-parameters to be sampled (size), but this came at the sacrifice of time (the algorithm was taking more than a few minutes to run). Similarly, I was surprised that kNN performed the best, but this could be explained by the linear (less complex) relationships between all of the predictors and value, and the relatively small dataset I worked with. I also was able to understand the kNN algorithm more in-depth and used Principal Component Analysis, whereas I was less familiar with Boosted Trees and Random Forest. I tried my best, but I might have not fully understood the tuning and optimization methods behind these algorithms.